24 research outputs found

    The Effect of Distinct Geometric Semantic Crossover Operators in Regression Problems

    Get PDF
    This paper investigates the impact of geometric semantic crossover operators in a wide range of symbolic regression problems. First, it analyses the impact of using Manhattan and Euclidean distance geometric semantic crossovers in the learning process. Then, it proposes two strategies to numerically optimize the crossover mask based on mathematical properties of these operators, instead of simply generating them randomly. An experimental analysis comparing geometric semantic crossovers using Euclidean and Manhattan distances and the proposed strategies is performed in a test bed of twenty datasets. The results show that the use of different distance functions in the semantic geometric crossover has little impact on the test error, and that our optimized crossover masks yield slightly better results. For SGP practitioners, we suggest the use of the semantic crossover based on the Euclidean distance, as it achieved similar results to those obtained by more complex operators

    Enhancement of Epidemiological Models for Dengue Fever Based on Twitter Data

    Full text link
    Epidemiological early warning systems for dengue fever rely on up-to-date epidemiological data to forecast future incidence. However, epidemiological data typically requires time to be available, due to the application of time-consuming laboratorial tests. This implies that epidemiological models need to issue predictions with larger antecedence, making their task even more difficult. On the other hand, online platforms, such as Twitter or Google, allow us to obtain samples of users' interaction in near real-time and can be used as sensors to monitor current incidence. In this work, we propose a framework to exploit online data sources to mitigate the lack of up-to-date epidemiological data by obtaining estimates of current incidence, which are then explored by traditional epidemiological models. We show that the proposed framework obtains more accurate predictions than alternative approaches, with statistically better results for delays greater or equal to 4 weeks.Comment: ACM Digital Health 201

    Revisiting the Sequential Symbolic Regression Genetic Programming

    Get PDF
    Sequential Symbolic Regression (SSR) is a technique that recursively induces functions over the error of the current solution, concatenating them in an attempt to reduce the error of the resulting model. As proof of concept, the method was previously evaluated in one-dimensional problems and compared with canonical Genetic Programming (GP) and Geometric Semantic Genetic Programming (GSGP). In this paper we revisit SSR exploring the method behaviour in higher dimensional, larger and more heterogeneous datasets. We discuss the difficulties arising from the application of the method to more complex problems, e.g., overfitting, along with suggestions to overcome them. An experimental analysis was conducted comparing SSR to GP and GSGP, showing SSR solutions are smaller than those generated by the GSGP with similar performance and more accurate than those generated by the canonical GP

    Reducing Dimensionality to Improve Search in Semantic Genetic Programming

    Get PDF
    Genetic programming approaches are moving from analysing the syntax of individual solutions to look into their semantics. One of the common definitions of the semantic space in the context of symbolic regression is a n-dimensional space, where n corresponds to the number of training examples. In problems where this number is high, the search process can became harder as the number of dimensions increase. Geometric semantic genetic programming (GSGP) explores the semantic space by performing geometric semantic operations—the fitness landscape seen by GSGP is guaranteed to be conic by construction. Intuitively, a lower number of dimensions can make search more feasible in this scenario, decreasing the chances of data overfitting and reducing the number of evaluations required to find a suitable solution. This paper proposes two approaches for dimensionality reduction in GSGP: (i) to apply current instance selection methods as a pre-process step before training points are given to GSGP; (ii) to incorporate instance selection to the evolution of GSGP. Experiments in 15 datasets show that GSGP performance is improved by using instance reduction during the evolution

    A Generic Framework for Building Dispersion Operators in the Semantic Space

    Get PDF
    This chapter proposes a generic framework to build geometric dispersion (GD) operators for Geometric Semantic Genetic Programming in the context of symbolic regression, followed by two concrete instantiations of the framework: a multiplicative geometric dispersion operator and an additive geometric dispersion operator. These operators move individuals in the semantic space in order to balance the population around the target output in each dimension, with the objective of expanding the convex hull defined by the population to include the desired output vector. An experimental analysis was conducted in a testbed composed of sixteen datasets showing that dispersion operators can improve GSGP search and that the multiplicative version of the operator is overall better than the additive version

    An ant colony-based semi-supervised approach for learning classification rules

    Get PDF
    Semi-supervised learning methods create models from a few labeled instances and a great number of unlabeled instances. They appear as a good option in scenarios where there is a lot of unlabeled data and the process of labeling instances is expensive, such as those where most Web applications stand. This paper proposes a semi-supervised self-training algorithm called Ant-Labeler. Self-training algorithms take advantage of supervised learning algorithms to iteratively learn a model from the labeled instances and then use this model to classify unlabeled instances. The instances that receive labels with high confidence are moved from the unlabeled to the labeled set, and this process is repeated until a stopping criteria is met, such as labeling all unlabeled instances. Ant-Labeler uses an ACO algorithm as the supervised learning method in the self-training procedure to generate interpretable rule-based models—used as an ensemble to ensure accurate predictions. The pheromone matrix is reused across different executions of the ACO algorithm to avoid rebuilding the models from scratch every time the labeled set is updated. Results showed that the proposed algorithm obtains better predictive accuracy than three state-of-the-art algorithms in roughly half of the datasets on which it was tested, and the smaller the number of labeled instances, the better the Ant-Labeler performance

    Multiobjective Genetic Algorithms for Attribute Selection

    No full text
    Attribute selection is one of the tasks that can be performed during the preprocessing of the data to be mined. It is an important task because, in the majority of the cases, data is collected for purposes other than classification. As a result, databases usually contain many irrelevant attributes, and if these attributes are not removed they can hinder the process of learning. This work proposes a multiobjective Genetic Algorithm (GA) for attribute selection. Its development and implementation were motivated by the great success obtained by GAs in applications where the search space is vast and by the advantage of performing a global search in the space of candidate solutions, unlike other algorithms based on local search. The proposed GA uses concepts of multiobjective optimization, since the attribute selection problem requires, in our case, the optimization of two objectives: the classification error and the number of rules generated by a rule induction algorithm. The evaluation of the individuals is performed according to the the wrapper approach, i.e., the evaluation of each individual of the population involves running the classification algorithm to be used later (with the set of selected attributes), in order to make the attribute selection procedure more robust. The classification algorithm used in this work is C4.5. In addition to the multiobjective GA, this work also proposes a multiobjective version of the forward sequential selection method, in order to compare multiobjective versions of two methods often used in the attribute selection task. Experiments in 18 public-domain databases showed that the multiobjective genetic algorithm and the multiobjective forward feature selection algorithm proposed can solve the feature selection task better than the single objective methods
    corecore